3DS : Cherry Blossom

Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser

With our data visualization we are determining the association between age and fitness based on running data from the Cherry Blossom Ten-mile Run held in Washington DC from 1973 to 2020.

Background:

The Credit Union Cherry Blossom (CUCB) is a non-profit organization that runs an annual 10-mile marathon in Washington, DC. They have asked our consulting firm to explore the relationship between fitness and aging by examining the data available on their website.

Data Extraction Method:

We created a Jupyter notebook to conduct the data scraping from the webpage: <https://www.cballtimeresults.org/performance-search/?eventType=10M&year=1973&division=M&page=1>. The Python library "requests" was used to connect to the URL, and "BeatifulSoup" was used to parse through the HTML and extract the data. We wrote a function that iterated through the specified variables (section, division, year, and page) to ensure all the data was collected. The CUCB website contains data for the following columns: Name, PiD/TiD, PiS/TiS, Age, Time, Pace, Division, and Home Town. The data was scraped for men and women between 1973 and 2019. 1973 was the first year the event was held, and 2019 was the year before organizers canceled the event for the first time due to COVID-19.

We incorporated weather data from the National Oceanic and Atmospheric Administration (NOAA) and the National Centers for Environmental Information. The closest option to the marathon was the Washington Reagon Airport, Arlington, VA, situated on the banks of the Potomac across from the National Mall, where most of the marathon takes place. From that data set, we added precipitation and minimum and maximum temperatures for each day the race took place. Each event date was manually recorded from the Rite of Spring pdf that detailed the history of the marathon and used to join only the relevant data for each event date.

Describing our Data

Variable Names Data Types Variable Descriptions
Year Integer Year the race was held.
Name Character

An individual’s first and last name with varying formats. Most of the CUCB website results for names also list an ‘M’, ‘F’, or ‘W’ in parenthesis for the individual’s sex.

example: James Yenckel (M)

Age Integer Age of runner at time of race.
Time Time/Numeric Time in hr:min:sec format to run 10 miles. This is how long it took each runner to complete the race.
Division Character

28 different divisions are contained, 14 in each sex. They range from 4 of
them having 20 year ranges, while the rest have 5 year ranges. Each division is an alphanumeric code separating competitors by sex and age. The example shows 25 to 29-year-old women.

example: W2529

pos_by_sex Integer Shows the place that a runner finished by sex per year.
total_by_sex Integer The total number of competitors overall for a sex per year.
Sex Character Gender of runner.
PRCP Numeric Precipitation recorded as daily rainfall in inches to one decimal place collected by NOAA.
TMIN Integer Minimum daily temperature recorded in Fahrenheit, collected by NOAA.
TMAX Integer Maximum daily temperature recorded in Fahrenheit, collected by NOAA

Dataset Overview:

In the original data set we have 347402 rows and 17 columns. After cleaning the data set we ended up with 339934 rows and 11 columns. 7468 rows of data were omitted from the data we used because they had missing values for the time and/or age variables. Below is the description of the variables and data we excluded for our data analysis/visualization:

What was excluded Reason for exclusion
Hometown Many missing values and inconsistencies were found in the data entries. We found a few individuals reporting their hometown differently each time they ran the race or just reporting several at once. Due to this we decided to remove this variable from our analysis because there are no accurate conclusions that can be drawn. Also this is not a variable we could use to fulfill our main objective, so we chose to exclude it from our analysis.
Distance The data in this column was describing the race of this length which is 10 miles. Since we already know it is the data from the 10 mile race, having a column that explicitly states that for row of our data is redundant.
Date We decided to exclude this variable since we know that the race happens at a certain time each year during spring, and having the specific dates would not impact our data question in any way.
pos_by_div This variable gives us the position that a runner finished in their assigned division for a certain year. There are 28 different divisions, 14 per sex. Each of these divisions includes a age range which differs from 20 years to 5 years. We decided to exclude this variable since the same information could be obtained from the pos_by_sex variable and we could instill our own age ranges.
total_by_division This variable gives the total number of individuals in each division for a certain year. The divisions are the same as described above and are excluded for the same reason as above.
Pace The Pace gave the pace per mile of each runner for the race. We decided to exclude this from our analysis because of the fact it wasn’t reading in correctly. Also the pace can be calculated directly from the Time variable by dividing it by 10 (the total miles in the race). Therefore we decided to remove this column of data from our data frame.
Data from the year of 1977 We decided to remove the data from the year of 1977 due to the fact that there was a large chunk of data missing from the times right in the middle of the race time. We are not given information about what happened in that period that resulted in such record, so we don’t have any background about that. Also if we keep this year in our data analysis, is has a potential to make our data analysis biased since there are a lot of points missing from a main part of the times, which would lead to inaccuracy in our interpretations. That is why we have decided to exclude the year from our data.

Loading and Cleaning Data

Summary Statistics:

Year, Age, Time, Sex main variables to focus on.

Checklist for this section:

summary stats: mean, median, mode, range, sd, percentiles, distributions by sex variable, etc.

mention how many women and how many men in each year and overall

summary.data.frame(df)
##       Year          Name                Age            Time         
##  Min.   :1974   Length:339214      Min.   : 8.0   Min.   :00:43:20  
##  1st Qu.:2001   Class :character   1st Qu.:29.0   1st Qu.:01:19:35  
##  Median :2009   Mode  :character   Median :35.0   Median :01:30:50  
##  Mean   :2006                      Mean   :36.6   Mean   :01:31:25  
##  3rd Qu.:2015                      3rd Qu.:43.0   3rd Qu.:01:42:22  
##  Max.   :2019                      Max.   :87.0   Max.   :02:20:00  
##                                                                     
##    Division           pos_by_sex     total_by_sex       Sex           
##  Length:339214      Min.   :    1   Min.   :   27   Length:339214     
##  Class :character   1st Qu.: 1109   1st Qu.: 3513   Class :character  
##  Mode  :character   Median : 2445   Median : 6792   Mode  :character  
##                     Mean   : 3134   Mean   : 6298                     
##                     3rd Qu.: 4739   3rd Qu.: 9030                     
##                     Max.   :11042   Max.   :11042                     
##                     NA's   :6       NA's   :6                         
##       PRCP             TMAX           TMIN      
##  Min.   :0.0000   Min.   :44.0   Min.   :32.00  
##  1st Qu.:0.0000   1st Qu.:56.0   1st Qu.:39.00  
##  Median :0.0000   Median :64.0   Median :43.00  
##  Mean   :0.0538   Mean   :63.3   Mean   :43.11  
##  3rd Qu.:0.0500   3rd Qu.:70.0   3rd Qu.:47.00  
##  Max.   :0.9300   Max.   :84.0   Max.   :58.00  
## 
 plot_age_dist <- ggplot(df, aes(x = Age, y = as.factor(Year))) +
    geom_density_ridges_gradient(
      aes(fill = ..x..), scale = 3, size = 0.3
    ) +
    scale_fill_gradientn(
      colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
      name = "Age"
    ) +
    labs(title = 'Age Distribution by Year', y="")
  
plot_age_dist
## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Picking joint bandwidth of 1.6

custom_ticks <- c("00:43", "01:07", "01:30", "01:56", "02:20")
tick_positions <- seq(min(df$Time), max(df$Time), length.out = length(custom_ticks))
tick_labels <- times(tick_positions)

  # Plotting density ridgeline plot for Time by Year
 plot_time_dist <-  ggplot(df, aes(x = Time, y = as.factor(Year))) +
    geom_density_ridges_gradient(
      aes(fill = ..x..), scale = 3, size = 0.3
    ) +
    scale_fill_gradientn(
      colours = c("red", "purple", "blue"),
      name = "Time to finish",
      breaks = tick_positions,
      labels = custom_ticks
    ) +
   scale_x_continuous(labels = tick_labels, breaks = tick_positions
    ) +
    labs(title = 'Time Distribution by Year', y = "")

 plot_time_dist
## Picking joint bandwidth of 0.00152

# Creating a scatterplot function that inputs the current year and data and
# outputs scatterplot for the year and its trend line

scat_plot <- function(curr_year, df) {
  # Subsetting data by given year
  sub_data <- df %>% filter(Year == curr_year)
  
  # Scatterplot using ggplot2
  ggplot(sub_data, aes(x = Age, y = as.numeric(Time))) +
    geom_point(col = "blue", shape = 1) +
    geom_smooth(method = "lm", se = FALSE, col = "red") +
    labs(title = as.character(curr_year), x = "Age (years)", y = "Time (hh:mm)") +
    scale_y_continuous(labels = custom_ticks, breaks = tick_positions)
}

# Setting up the layout
par(mfrow = c(2, 4))

# Loops over years 1973:2019 and calls the scat_plot function
# unique(Year) skips 1977 and will work still work if we remove other years
for (curr_year in unique(df$Year)) {
  print(scat_plot(curr_year, df))
}
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

df %>%
  ggplot() +
  geom_point(aes(x = Age, y = Time)) + 
  facet_wrap(~Sex) 
## Don't know how to automatically pick scale for object of type <times>.
## Defaulting to continuous.